---
title: "Colombo Amenity Distribution: Point Pattern Analysis & Spatial Regression"
author: "Nimesh Akalanka"
date: today
format:
html:
toc: true
toc-depth: 3
toc-float: true
code-fold: false
code-tools: true
theme: cosmo
self-contained: true
resource-embed: true
execute:
warning: false
message: false
echo: true
---
# Introduction
This notebook explores the spatial distribution of educational and healthcare facilities across the Colombo district in Sri Lanka. The analysis is structured into two main sections: Point Pattern Analysis and Spatial Regression Analysis.
# User Stories
## User Story 1: Urban Planner
::: {.callout-note appearance="minimal"}
**Priya Jayawardena** | Age 34 | Urban Planner
Urban Development Authority (UDA), Colombo, Sri Lanka
:::
### Responsibilities
- Develop and review spatial development plans for Colombo district at the GND level.
- Identify gaps in public facility coverage and recommend locations for new infrastructure investments.
- Coordinate with the Ministry of Education to align school and kindergarten placement with population growth patterns.
- Prepare evidence-based reports to support budget allocation decisions for educational infrastructure.
### Experiences
- Has extensive experience in facility gap analysis and spatial planning for public services.
- Familiar with GIS-based tools for mapping and analyzing urban service coverage.
- Previously worked on school rationalization projects identifying overcrowded vs. underutilized facilities.
### Challenges
- Lacks fine-grained spatial data on how educational facilities are distributed relative to
population density at the GND level.
- Difficult to justify new facility placements to stakeholders without quantitative evidence
of clustering or underservice.
- Existing planning reports rely on administrative counts rather than per-capita or
density-adjusted measures, which misrepresent actual access gaps.
### Goals
- Identify which GNDs in Colombo are statistically underserved by educational facilities
on a per-capita basis.
- Use spatial regression findings to understand what infrastructure and demographic
factors drive the current uneven distribution of schools and kindergartens.
- Produce a prioritized list of GNDs for new educational facility investment backed by
spatial evidence.
---
# 1 Data Loading & Preparation
These steps include the preparation of spatial data for the analysis of Colombo District. The task involves loading relevant datasets, converting the coordinate system to EPSG:32644 for proper dimension and area calculations, and merging demographic data with building information. The goal is to calculate attributes related to population demographics, housing, road infrastructure, and building area.
- **Population Groups:**
- `Population_0_14`: Number of people aged 0-14.
- `Population_15_59`: Number of people aged 15-59.
- `Population_60_64`: Number of people aged 60-64.
- `Population_Above_65`: Number of people aged above 65.
- `Male`: Total number of males.
- `Female`: Total number of females.
- `Total_Population`: Total number of people (Male + Female).
- **Housing and Employment:**
- `Occupied_Housing_Units`: Number of occupied housing units.
- `Poverty_Percentage`: Percentage of the population living in poverty.
- `Unemployment_Rate`: Unemployment rate in the area.
- **Infrastructure:**
- `Road_Density`: Density of roads in the area.
- `Road_Length`: Total length of roads in the area.
- **Geospatial Calculations:**
- `Population_Density`: Population density per square kilometer.
- `Building_Area`: Area of buildings (units: square meters)
Furthermore all the datasets are clipped in to the Colombo district boundary to ensure that the analysis is focused on the relevant geographic area.
```{r data_loading}
# Load necessary libraries
library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tmap)
library(terra)
library(spatstat)
library(spdep)
library(spatialreg)
library(knitr)
library(kableExtra)
library(patchwork)
# Improve figure clarity in rendered output
knitr::opts_chunk$set(
fig.align = "center",
fig.width = 9,
fig.height = 6,
dpi = 180,
out.width = "100%"
)
# Select an projected coordinate system and set up the output directory
utm_crs <- 32644
output_dir <- "output"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
gpkg_path <- file.path(output_dir, "output_layers.gpkg")
# Load Sri Lankan GND boundaries
sl_gnds <- st_read("data/sl_gnds.shp", quiet = TRUE) %>%
st_make_valid() %>%
st_transform(utm_crs)
# Consider only Colombo district for this analysis
names(sl_gnds)
unique(sl_gnds$DISTRICT_N)
cmb_gnds <- sl_gnds %>% filter(DISTRICT_N == "Colombo")
cmb_boundary <- st_union(cmb_gnds)
# Read the census population data from census csv file
cmb_census <- read.csv("data/cmb_census_2024.csv")
# Join the census population data to the Colombo GNDs
# Consider both the GND and DSD names
cmb_gnds <- cmb_gnds %>%
mutate(
GND_N = trimws(toupper(GND_N)),
DSD_N = trimws(toupper(DSD_N))
) %>%
left_join(
cmb_census %>%
mutate(
GND_Name = trimws(toupper(GND_Name)),
DSD_Name = trimws(toupper(DSD_Name)),
AG_0_14 = as.numeric(gsub(",", "", trimws(AG_0_14))),
AG_15_59 = as.numeric(gsub(",", "", trimws(AG_15_59))),
AG_60_64 = as.numeric(gsub(",", "", trimws(AG_60_64))),
AG_65_and_above = as.numeric(gsub(",", "", trimws(AG_65_and_above))),
Male = as.numeric(gsub(",", "", trimws(Male))),
Female = as.numeric(gsub(",", "", trimws(Female))),
Total = as.numeric(gsub(",", "", trimws(Total))),
Occupied_Housing_Units = as.numeric(gsub(",", "", trimws(Occupied_Housing_Units))),
Poverty_Percentage = as.numeric(gsub(",", "", trimws(Poverty_Percentage))),
Unemployment_Rate = as.numeric(gsub(",", "", trimws(Unemployment_Rate))),
Road_Density = as.numeric(gsub(",", "", trimws(Road_Density))),
Road_Length = as.numeric(gsub(",", "", trimws(Road_Length)))
) %>%
select(
DSD_Name, GND_Name, AG_0_14, AG_15_59, AG_60_64, AG_65_and_above, Male, Female, Total,
Occupied_Housing_Units, Poverty_Percentage, Unemployment_Rate, Road_Density, Road_Length
),
by = c("DSD_N" = "DSD_Name", "GND_N" = "GND_Name")
) %>%
mutate(
GND_N = tools::toTitleCase(tolower(GND_N)),
DSD_N = tools::toTitleCase(tolower(DSD_N)),
Area_sqkm = as.numeric(st_area(geometry)) / 1e6,
Pop_density_sqkm = Total / Area_sqkm
)
# Check how many GNDs were successfully matched
cat("GNDs with population data:", sum(!is.na(cmb_gnds$Total)), "of", nrow(cmb_gnds), "\n")
# Load Sri Lankan Building footprints
cmb_build <- st_read("data/sl_buildings.shp", quiet = TRUE) %>%
st_transform(utm_crs) %>%
st_intersection(cmb_boundary)
# Calculate total building footprint area per GND (Unit: square meters)
building_area_by_gnd <- cmb_build %>%
mutate(building_area = as.numeric(st_area(.))) %>%
st_join(cmb_gnds %>% select(GND_N)) %>%
st_drop_geometry() %>%
group_by(GND_N) %>%
summarise(Building_area = sum(building_area, na.rm = TRUE), .groups = "drop")
# Join building area back to cmb_gnds
cmb_gnds <- cmb_gnds %>%
left_join(building_area_by_gnd, by = "GND_N") %>%
mutate(Building_area = coalesce(Building_area, 0))
# Save the file to GeoPackage
st_write(cmb_build, gpkg_path, layer = "cmb_build", delete_layer = TRUE, quiet = TRUE)
# Show the information about the loaded buildings and POI data
cat("Number of GNDs loaded:", nrow(cmb_gnds), "\n")
```
In this study mainly 02 main amenity categories are considered in the OSM dataset under the fclass attribute:
- **Education Facilities:**
- `school`
- `university`
- `college`
- `kindergarten`
- **Healthcare Facilites:**
- `hospital`
- `clinic`
- `pharmacy`
- `doctors`
- `dentist`
- `veterinary`
- `nursing_home`
Then other POI categories are filtered out and only the above two categories are retained for the analysis.
```{r classify_pois}
# Import Colombo POI data as points
cmb_poi <- st_read("data/sl_pois.shp", quiet = TRUE) %>%
st_transform(utm_crs) %>%
st_intersection(cmb_boundary)
# Inspect the POI attribute names
print(names(cmb_poi))
# Classify POIs based on fclass
cmb_poi <- cmb_poi %>%
mutate(
amenity_type = case_when(
fclass %in% c("school", "university", "college", "kindergarten") ~ "education",
fclass %in% c("hospital", "clinic", "pharmacy", "doctors", "dentist", "veterinary", "nursing_home") ~ "healthcare",
TRUE ~ NA_character_
)) %>% filter(!is.na(amenity_type))
# Perform a spatial join to associate GND_N with POIs
cmb_poi_with_gnd <- st_join(cmb_poi, cmb_gnds %>% select(GND_N))
# Check if the GND_N is correctly added to the POI data
print(names(cmb_poi_with_gnd))
head(cmb_poi_with_gnd)
# Count education and healthcare POIs per GND
poi_counts_by_gnd <- cmb_poi_with_gnd %>%
st_drop_geometry() %>%
group_by(GND_N, amenity_type) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(names_from = amenity_type, values_from = count, values_fill = list(count = 0))
# Keep only final count columns for a stable join schema
poi_counts_by_gnd <- poi_counts_by_gnd %>%
transmute(
GND_N,
education_count = as.integer(education),
healthcare_count = as.integer(healthcare)
)
# Check the resulting table
print(names(poi_counts_by_gnd))
print(head(poi_counts_by_gnd))
# Join the POI category counts with density, count per 1000 people
cmb_gnds <- cmb_gnds %>%
select(-any_of(c("education", "healthcare", "education_count", "healthcare_count",
"education_density", "healthcare_density",
"education_per1000", "healthcare_per1000"))) %>%
left_join(poi_counts_by_gnd, by = "GND_N") %>%
mutate(
education_count = coalesce(education_count, 0L),
healthcare_count = coalesce(healthcare_count, 0L),
education_density = education_count / Area_sqkm,
healthcare_density = healthcare_count / Area_sqkm,
education_per1000 = (education_count / Total) * 1000,
healthcare_per1000 = (healthcare_count / Total) * 1000
)
# Save the classified POIs to GeoPackage
st_write(cmb_poi, gpkg_path, layer = "cmb_poi", delete_layer = TRUE, quiet = TRUE)
st_write(cmb_gnds, gpkg_path, layer = "cmb_gnds", delete_layer = TRUE, quiet = TRUE)
```
---
# 2 Visualize the POI Distribution
Visualize the two amenity categories (educational and healthcare) across Colombo.
```{r visualization_poi, fig.height=10}
# Split the data based on amenity categories
education <- cmb_poi %>% filter(amenity_type == "education")
healthcare <- cmb_poi %>% filter(amenity_type == "healthcare")
# Define a function to create maps
make_map <- function(pts, title, dot_color) {
ggplot() +
geom_sf(data = cmb_gnds, fill = "#f7f7f7", color = "#d9d9d9", linewidth = 0.25) +
geom_sf(data = st_union(cmb_gnds), fill = NA, color = "#333333", linewidth = 0.6) +
geom_sf(
data = pts,
color = dot_color,
size = 1.7,
alpha = 0.75,
shape = 16
) +
labs(
title = title,
subtitle = paste0("n = ", nrow(pts), " points")
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(
face = "bold",
hjust = 0,
size = 14,
margin = margin(b = 3)
),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0, margin = margin(b = 8)),
plot.background = element_rect(fill = "white", color = NA),
plot.margin = margin(10, 10, 10, 10)
) +
coord_sf(expand = FALSE)
}
# Plot the map
m1 <- make_map(education, "Educational Facilities", "#2166ac")
m2 <- make_map(healthcare, "Healthcare Facilities", "#d6604d")
m1 / m2 +
plot_annotation(
title = "Amenity Distribution across Colombo District",
caption = "Source: OpenStreetMap | CRS: UTM Zone 44N",
theme = theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
plot.caption = element_text(color = "grey50", size = 9, hjust = 1)
)
)
```
---
# 3 Amenity Distribution Analysis
Are the educational, healthcare facilities in Colombo distributed randomly, clustered or regularly distributed?
Since spatstat requries a observation window, Colombo boundary is used as the observation window.
## 3.1 Quadrat Count Analysis
```{r point_pattern_analysis, fig.height=10}
library(viridis)
# Defining the observation window for the amenity categories
education_ppp <- as.ppp(education, W = as.owin(cmb_boundary))
healthcare_ppp <- as.ppp(healthcare, W = as.owin(cmb_boundary))
# Test the quadrant count for each amenity
education_quadrant <- quadratcount(education_ppp, nx = 10, ny = 10)
healthcare_quadrant <- quadratcount(healthcare_ppp, nx = 10, ny = 10)
# Plot the point distribution in the quardrants
op <- par(no.readonly = TRUE)
par(mfrow = c(2, 1), mar = c(3.8, 3.8, 3, 1.5), bg = "white")
plot(education_quadrant, main = "Educational Facilities | Quadrat Count", col = viridis(10), legend = TRUE, main.cex = 1.2)
plot(st_geometry(education), add = TRUE, col = adjustcolor("grey70", alpha.f = 0.4), pch = 16)
plot(as.owin(cmb_boundary), add = TRUE, border = "grey20", lwd = 1)
plot(healthcare_quadrant, main = "Healthcare Facilities | Quadrat Count", col = viridis(10),legend = TRUE, main.cex = 1.2)
plot(st_geometry(healthcare), add = TRUE, col = adjustcolor("grey70", alpha.f = 0.4), pch = 16)
plot(as.owin(cmb_boundary), add = TRUE, border = "grey20", lwd = 1)
par(op)
```
## 3.2 Chi-squared Test for Complete Spatial Randomness (CSR)
Let's do a chi-squared test to see if the observed distribution of amenities in the quadrants follow a complete spatial randomness (CSR) pattern or do they show clustering or regularity.
```{r chi_squared_test_edu}
# Test the quadrat counts for each amenity category
quadrat.test(education_quadrant)
```
p-value is less than 0.05 (if we consider a significance of 5%), reject the null hypothesis of complete spatial randomness (CSR) for educational facilities, indicating that they are not randomly distributed across Colombo. The observed distribution is likely to be clustered or regular. The magnitude of the test statistic constitute strong evidence against randomness, indicates an extremely uneven distribution, where educational facilities are heavily concentrated in a small number of quadrats while large portions of Colombo have very few or none at all. This points to significant spatial clustering of educational facilities across the district.
```{r chi_squared_test_health}
# Test the quadrat counts for each amenity category
quadrat.test(healthcare_quadrant)
```
The p-value is less than 0.05 (at a 5% significance level), we reject the null hypothesis of Complete Spatial Randomness (CSR) for healthcare facilities, indicating that they are not
randomly distributed across Colombo. The observed distribution is likely to be clustered or regular. Although the test statistic (X² = 808.99) is considerably smaller than that of educational facilities, it still constitutes strong evidence against randomness, indicating an uneven distribution where healthcare facilities are concentrated in certain quadrats while other parts of Colombo remain comparatively underserved. This points to significant spatial clustering of healthcare facilities across the district.
## 3.3 Density Analysis
Perform a kernel density estimation to visualize the intensity of the given amenity categories see their visual differences in terms of spatial distribution across Colombo.
```{r kde, fig.height=10}
# Calculate the kernel density
education_dens <- density(education_ppp, sigma = bw.diggle)
healthcare_dens <- density(healthcare_ppp, sigma = bw.diggle)
# Plot the kernel density
op <- par(no.readonly = TRUE)
par(mfrow = c(2, 1), mar = c(3.8, 3.8, 3, 1.5), bg = "white")
plot(education_dens, main = "Educational Facilities KDE", col = viridis(100))
plot(as.owin(cmb_boundary), add = TRUE, border = "grey20", lwd = 1)
plot(healthcare_dens, main = "Healthcare Facilities KDE", col = viridis(100))
plot(as.owin(cmb_boundary), add = TRUE, border = "grey20", lwd = 1)
par(op)
# Plot the GND wise Density Map
p1 <- ggplot() +
geom_sf(data = cmb_gnds, aes(fill = education_density),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_gnds), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_viridis_c(
option = "viridis",
name = "Facilities\nper km²",
na.value = "grey90",
trans = "sqrt"
) +
labs(
title = "Educational Facility Density",
subtitle = "Facilities per km² by GND"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0,
margin = margin(b = 3)),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0,
margin = margin(b = 8)),
plot.background = element_rect(fill = "white", color = NA),
plot.margin = margin(10, 10, 10, 10),
legend.position = "right"
) +
coord_sf(expand = FALSE)
p2 <- ggplot() +
geom_sf(data = cmb_gnds, aes(fill = healthcare_density),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_gnds), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_viridis_c(
option = "viridis",
name = "Facilities\nper km²",
na.value = "grey90",
trans = "sqrt"
) +
labs(
title = "Healthcare Facility Density",
subtitle = "Facilities per km² by GND"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0,
margin = margin(b = 3)),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0,
margin = margin(b = 8)),
plot.background = element_rect(fill = "white", color = NA),
plot.margin = margin(10, 10, 10, 10),
legend.position = "right"
) +
coord_sf(expand = FALSE)
p1 / p2 +
plot_annotation(
title = "GND wise Facility Density across Colombo District",
caption = "Source: OpenStreetMap | Area-adjusted counts | CRS: UTM Zone 44N",
theme = theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
plot.caption = element_text(color = "grey50", size = 9, hjust = 1)
)
)
```
## 3.4 Selecting an Appropriate Correlation Method
Let's see is the distribution of educational and healthcare facilities are similar accross Colombo district. First check that what type of distribution the KDE values of both categories follw.
```{r normality_test_education}
#| code-fold: true
set.seed(13244)
# Convert spatstat density objects to terra rasters
education_rast <- rast(education_dens)
healthcare_rast <- rast(healthcare_dens)
# Resample healthcare raster to match education raster grid exactly
healthcare_rast_resampled <- resample(healthcare_rast, education_rast)
# Extract raster cell values
education_vals <- values(education_rast)
healthcare_vals <- values(healthcare_rast_resampled)
# Remove NA cells (cells outside the observation window)
valid <- !is.na(education_vals) & !is.na(healthcare_vals)
cat("Valid raster cells used:", sum(valid), "\n")
# Shapiro-Wilk requires n <= 5000
sample_idx <- sample(which(valid), min(5000, sum(valid)))
# Plot the histogram to visualize the distribution of intensity values for both categories
hist(education_vals[valid],
breaks = 50,
main = "Education KDE",
xlab = "Intensity",
col = "#2166ac80",
border = "white")
# Test on raw values of education
shapiro_edu <- shapiro.test(education_vals[sample_idx])
cat("Education KDE: W =", round(shapiro_edu$statistic, 4), "| p =", shapiro_edu$p.value, "\n")
```
Since the p-value of the Shapiro-Wilk test is 3.614476e-85, which is significantly less than 0.05, null hypothesis of normality distribution can be rejected. This indicates that the distribution of educational facility intensity across Colombo significantly deviates from normality, suggesting a non-normal distribution.
```{r normality_test_healthcare}
#| code-fold: true
# Plot the histogram
hist(healthcare_vals[valid],
breaks = 50,
main = "Healthcare KDE",
xlab = "Intensity",
col = "#d6604d80",
border = "white")
# Shapiro Wilk test for the KDE values of healthcare facilities
shapiro_hlth <- shapiro.test(healthcare_vals[sample_idx])
cat("Healthcare KDE: W =", round(shapiro_hlth$statistic, 4), "| p =", shapiro_hlth$p.value, "\n")
```
Also the healthcare KDE values have p-value of p = 4.798521e-88, which is significantly less than 0.05, which says that the healthcare facility intensity distribution across Colombo also significantly deviates from normality, indicating a non-normal distribution.
Both educational and healthcare facilities show non-normal distributions in their spatial intensity across the district. Therefore using the Pearson R correlation coefficient to measure the relationship between the two categories may not be appropriate.
## 3.5 Correlation Analysis
Since both educational and healthcare facility intensity distributions are confirmed to be non-normal, Spearman's rank correlation is used as the primary measure of spatial association. The analysis is conducted at two spatial scales to provide a comprehensive understanding of the relationship between the two amenity types across Colombo.
```{r spearman_kde}
#| code-fold: true
# Pixel wise KDE Correlation
# Spearman's rank correlation test
spearman_kde <- cor.test(
education_vals[valid],
healthcare_vals[valid],
method = "spearman"
)
# GND wise KDE Correlation
# Spearman's rank correlation test
spearman_gnd <- cor.test(
cmb_gnds$education_density,
cmb_gnds$healthcare_density,
method = "spearman"
)
# Summary of the correlation results
cat(" - Pixel wise Spearman Correlation rho:", round(spearman_kde$estimate, 4), "\n")
cat(" - GND wise Spearman Correlation rho:", round(spearman_gnd$estimate, 4), "\n")
```
| Scale | rho | Strength |
|---|---|---|---|
| Pixel wise KDE | 0.5544 | Moderate positive |
| GND wise Density | 0.3229 | Moderate positive |
At the pixel level, there is a moderate positive correlation between the intensity of education and healthcare facilities, this reflect that areas with higher concentration of educational facilities also tend to have higher concentration of healthcare facilities as well.
Additionally, when the data is aggregated at the GND level, the correlation is also moderate positive, which suggests that when looking at GND level densities, there is still tendency for GNDs with higher education facilities also tend to have higher healthcare facilities as well. However the correlation is weaker at GND level compared to pixel level.
---
# 4 Spatial Regression Analysis
I want to understand the factors that influence the distribution of education and healthcare facilities across Colombo district in terms of GND level attributes.
## 4.1 Data Exploration and Preparation
First see the distribution of education and healthcare facilities per 1000 people across GNDs
```{r per1000_map, fig.height=10}
# Define break points and labels
edu_breaks <- c(0, 0.001, 0.5, 1.0, 2.0, max(cmb_gnds$education_per1000, na.rm = TRUE))
hlt_breaks <- c(0, 0.001, 0.5, 1.0, 2.0, max(cmb_gnds$healthcare_per1000, na.rm = TRUE))
# Plot the GND wise map for education and healthcare facilities per 1000 people
p1 <- ggplot() +
geom_sf(data = cmb_gnds, aes(fill = education_per1000),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_gnds), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_stepsn(
colours = RColorBrewer::brewer.pal(5, "Blues"),
breaks = c(0.001, 0.5, 1.0, 2.0),
limits = c(0, max(cmb_gnds$education_per1000, na.rm = TRUE)),
labels = c("Very Low (<0.5)", "Low (0.5–1)", "Medium (1–2)", "High (>2)"),
name = "Education Facilities\nper 1000 people",
na.value = "grey90",
guide = guide_coloursteps(show.limits = FALSE)
) +
labs(
title = "Education Facilities per 1000 People",
subtitle = "Education facilities per 1000 people by GND"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0,
margin = margin(b = 3)),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0,
margin = margin(b = 8)),
plot.background = element_rect(fill = "white", color = NA),
plot.margin = margin(10, 10, 10, 10),
legend.position = "right"
) +
coord_sf(expand = FALSE)
p2 <- ggplot() +
geom_sf(data = cmb_gnds, aes(fill = healthcare_per1000),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_gnds), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_stepsn(
colours = RColorBrewer::brewer.pal(5, "Reds"),
breaks = c(0.001, 0.5, 1.0, 2.0),
limits = c(0, max(cmb_gnds$healthcare_per1000, na.rm = TRUE)),
labels = c("Very Low (<0.5)", "Low (0.5–1)", "Medium (1–2)", "High (>2)"),
name = "Healthcare Facilities\nper 1000 people",
na.value = "grey90",
guide = guide_coloursteps(show.limits = FALSE)
) +
labs(
title = "Healthcare Facilities per 1000 People",
subtitle = "Healthcare facilities per 1000 people by GND"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0,
margin = margin(b = 3)),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0,
margin = margin(b = 8)),
plot.background = element_rect(fill = "white", color = NA),
plot.margin = margin(10, 10, 10, 10),
legend.position = "right"
) +
coord_sf(expand = FALSE)
p1 / p2 +
plot_annotation(
title = "Amenity Accessibility across Colombo District",
caption = "Source: OpenStreetMap & Census 2024 | CRS: UTM Zone 44N",
theme = theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
plot.caption = element_text(color = "grey50", size = 9, hjust = 1)
)
)
```
Let's prepare regression datasets for both educational and healthcare facilities.
Since spatial models assume linearity log transformations are applied to skewed data such as population density, road density and occupied housing units.
Additionally, demographic percentages are calculated to capture the age structure of the population and also for the built up area, which can influence the demand for the type of amenities this study considers.
```{r regression_data_preparation}
# Prepare the regression dataset
cmb_reg <- cmb_gnds %>%
filter(
!is.na(Total), Total > 0,
!is.na(Pop_density_sqkm),
!is.na(Poverty_Percentage),
!is.na(Unemployment_Rate),
!is.na(Road_Density),
!is.na(Building_area),
!is.na(Occupied_Housing_Units)
) %>%
mutate(
log_pop_density = log1p(Pop_density_sqkm),
log_road_density = log1p(Road_Density),
log_occ_housing = log1p(Occupied_Housing_Units),
pct_builtup_area = (Building_area / Area_sqkm * 100),
pct_builtup_area = scale(pct_builtup_area)[,1],
pct_young = (AG_0_14 / Total) * 100,
pct_elderly = (AG_65_and_above / Total * 100),
pct_working_age = ((AG_15_59 + AG_60_64) / Total * 100)
)
# Check the number of GNDs in the Regression dataset
cat("GNDs for regression:", nrow(cmb_reg), "\n")
```
## 4.2 Build Spatial Weights Matrix
To analyze the spatial relationships, definition of spatial weights matrix is needed. In this study a queen contiguity-based spatial weights matrix is considered because this is a administrative boundary which any of the adjust administrative units can be considered as a neighbor and influence each other.
```{r spatial_weights}
#| code-fold: true
# Create a spatial weights matrix using queen contiguity
nb_queen <- poly2nb(cmb_reg, queen = TRUE)
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)
# Check the neighbour structure
cat(
"Total GNDs:", length(nb_queen), "\n",
"Average number of neighbours:", round(mean(card(nb_queen)), 2), "\n",
"GNDs with no neighbours:", sum(card(nb_queen) == 0), "\n",
"Min neighbours:", min(card(nb_queen)), "\n",
"Max neighbours:", max(card(nb_queen)), "\n"
)
```
This spatial weights matrix contains 556 GNDs with an average of 5.58 neighbours per GND. There is no GND with zero neighbours, which means all GNDs have at least one adjacent GND. The maximum number of neighbours is 13.
## 4.3 Define Spatial Regression Formulas
The 02 formulas for spatial regression models are defined with same set of independent variables to analyze the factors influencing the distribution of education and healthcare facilities across Colombo district.
```{r spatial_regression_formulas}
# Formular for education facilities per 1000 people
formula_education <- formula(
education_per1000 ~ log_pop_density + Poverty_Percentage +
Unemployment_Rate + log_road_density + log_occ_housing + pct_builtup_area +
pct_young + pct_elderly
)
# Formular for healthcare facilities per 1000 people
formula_healthcare <- formula(
healthcare_per1000 ~ log_pop_density + Poverty_Percentage +
Unemployment_Rate + log_road_density + log_occ_housing + pct_builtup_area +
pct_young + pct_elderly
)
```
## 4.4 OLS Baseline Models
First, OLS regression models are fitted for both educational and healthcare facilities to establish a baseline understanding of the relationships between the independent variables and the dependent variables without considering the spatial autocorrelation.
```{r ols_models}
#| code-fold: true
# OLS regression models for both education and healthcare facilities
ols_education <- lm(formula_education, data = cmb_reg)
ols_healthcare <- lm(formula_healthcare, data = cmb_reg)
# Display the summary of the OLS models
summary(ols_education)
summary(ols_healthcare)
```
OLS baseline model shows that for education facilities per 1000 people these factors considered in the formular accounts for only 15%, for the healthcare facilities per 1000 people the R-squared value is 20.56, which means the model only explains about 20.56% of the changes in healthcare facilities distribution across GNDs.
## 4.5 Moran's I Test on OLS Residuals
Run the Moran's I test on the residuals of the OLS models to see if there is any spatial autocorrelation present. If there is significant spatial autocorrelation in the residuals, it shows that the OLS model doesnt fully capture the spatial processes underline with the distribution of the education and healthcare amenities.
```{r morans_i_ols}
# Moran's I test on OLS residuals to check for spatial autocorrelation
moran_education <- lm.morantest(ols_education, lw_queen, zero.policy = TRUE)
moran_healthcare <- lm.morantest(ols_healthcare, lw_queen, zero.policy = TRUE)
# Display the results of Moran's I test
moran_education
moran_healthcare
```
The Moran's I test results shows significant spatial autocorrelation in the education facilities per 1000 people value of 0.25, which tell that there is a unexplained variation in the distribution of education facilities across GNDs.
Healthcare facilities show week clustering Moran's I = 0.073, p < 0.001 confirming that even though the spatial structure is weaker, it is not negligible and cannot be attributed to chance.
In both cases, the null hypothesis of spatial randomness is rejected, meaning OLS assumptions are violated and spatial regression models are necessary for reliable inference.
## 4.6 Lagrange Multiplier Diagnostics
```{r lm_diagnostics}
#| code-fold: true
lm_diag_education <- lm.LMtests(
ols_education, lw_queen,
test = c("LMlag", "LMerr", "RLMlag", "RLMerr"),
zero.policy = TRUE
)
lm_diag_healthcare <- lm.LMtests(
ols_healthcare, lw_queen,
test = c("LMlag", "LMerr", "RLMlag", "RLMerr"),
zero.policy = TRUE
)
lm_diag_education
lm_diag_healthcare
```
The robust Lagrange Multiplier diagnostics indicate that for both facility types the spatial dependence operates through the error term. The robust spatial error test is significant for both education (adjRSerr = 8.037, p = 0.005) and healthcare (adjRSerr = 8.138, p = 0.004), while the robust spatial lag test is not significant in either case. A Spatial Error Model is therefore selected for both outcomes, suggesting the spatial clustering is driven by unmeasured background factors influencing the distribution of amenities rather than direct spillover effects between GNDs.
## 4.7 Spatial Error Models
Since the residual autocorrelation is positive for both models, a spatial error model is fitted to account for spatial autocorrelation in the error terms, which is likely due to unobserved spatially structured factors influencing the distribution of education and healthcare facilities across GNDs.
```{r sem_models}
#| code-fold: true
sem_education <- errorsarlm(
formula_education,
data = cmb_reg,
listw = lw_queen,
zero.policy = TRUE
)
sem_healthcare <- errorsarlm(
formula_healthcare,
data = cmb_reg,
listw = lw_queen,
zero.policy = TRUE
)
summary(sem_education)
summary(sem_healthcare)
```
**Education SEM:** Road accessibility and built-up intensity are the strongest drivers of education facility distribution, with better connected and more urbanised GNDs having significantly more facilities. Denser GNDs are relatively underserved on a per-capita basis. The strong spatial error parameter (λ = 0.452) indicates that unmeasured spatially structured factors account for a significant share of the clustering pattern.
**Healthcare SEM:** Road accessibility and built-up intensity similarly drive healthcare facility distribution. However, the most striking finding is that GNDs with higher proportions of young residents are significantly underserved in healthcare, suggesting a demographic mismatch in provision. The weaker spatial error parameter (λ = 0.190) reflects that healthcare distribution is somewhat less spatially structured than education, consistent with the weaker Moran's I found earlier.
Across both models, population density shows a consistent negative relationship, denser GNDs tend to have fewer facilities per capita, pointing to a potential equity concern in Colombo's most densely populated areas.
## 4.8 Validate SEM Residuals
```{r sem_residuals}
#| code-fold: true
# Morans'I test on SEM residuals to check if spatial autocorrelation is adequately accounted for
moran_sem_edu <- moran.test(residuals(sem_education), lw_queen, zero.policy = TRUE)
moran_sem_hlt <- moran.test(residuals(sem_healthcare), lw_queen, zero.policy = TRUE)
moran_sem_edu
moran_sem_hlt
```
The Moran's I test on SEM residuals confirms that both models have successfully removed the spatial autocorrelation present in the OLS residuals. Education SEM residuals show a Moran's I of -0.027 (p = 0.839) and healthcare SEM residuals show -0.006 (p = 0.570), both well above the 0.05 significance threshold. The slightly negative Moran's I values are ideal, they indicate the residuals are now pure noise with no remaining spatial pattern, confirming that the Spatial Error Models have adequately captured the underlying spatial structure in both facility distributions.
## 4.9 Plot Residual Map
```{r residual_maps, fig.height=10}
#| code-fold: true
# Add SEM residual to the regression dataset for mapping
cmb_reg <- cmb_reg %>%
mutate(
resid_edu = residuals(sem_education),
resid_hlt = residuals(sem_healthcare)
)
# Plot the Map
m1 <- ggplot() +
geom_sf(data = cmb_reg, aes(fill = resid_edu),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_reg), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_distiller(
palette = "RdBu",
name = "Residual",
direction = 1,
limits = c(-max(abs(cmb_reg$resid_edu), na.rm = TRUE), max(abs(cmb_reg$resid_edu), na.rm = TRUE))
) +
labs(
title = "Education per 1000 - SEM Residuals",
subtitle = "No spatial pattern = model is valid"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 13, hjust = 0),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0),
plot.background = element_rect(fill = "white", color = NA)
)
m2 <- ggplot() +
geom_sf(data = cmb_reg, aes(fill = resid_hlt),
color = "#d9d9d9", linewidth = 0.15) +
geom_sf(data = st_union(cmb_reg), fill = NA,
color = "#333333", linewidth = 0.6) +
scale_fill_distiller(
palette = "RdBu",
name = "Residual",
direction = 1,
limits = c(-max(abs(cmb_reg$resid_hlt), na.rm = TRUE), max(abs(cmb_reg$resid_hlt), na.rm = TRUE))
) +
labs(
title = "Healthcare per 1000 - SEM Residuals",
subtitle = "No spatial pattern = model is valid"
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 13, hjust = 0),
plot.subtitle = element_text(size = 10, color = "#5f5f5f", hjust = 0),
plot.background = element_rect(fill = "white", color = NA)
)
m1 / m2 +
plot_annotation(
title = "Spatial Model Residual Maps, Colombo District",
caption = "Source: OpenStreetMap & Census 2024 | CRS: UTM Zone 44N",
theme = theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
plot.caption = element_text(color = "grey50", size = 9, hjust = 1)
)
)
```
Education map: Almost entirely white/near-zero residuals across the district with only a couple of isolated blue GNDs. No geographic clustering pattern is visible, residuals are randomly scattered.
Healthcare map: Similarly random mix of very light red and blue with no systematic geographic pattern. The values are small and spread without any clear directional clustering.
Both maps visually confirm what the Moran's I validation already told us statistically, the spatial structure has been fully captured by the models.